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Travelling salesman problem is a well researched problem in computer 
science and has many practical applications. It is classified as a NP-hard 
problem as its exact solution can only be obtained in exponential time unless 
P = NP. There are different variants of the travelling salesman problem 
(TSP) and in this paper, asymmetric travelling salesman problem is 
addressed since this variant is quite often observed in real world scenarios. 
There are a number of heuristic approaches to this problem which provides 
approximate solutions in polynomial time, however this paper proposes an 
exact optimal solution which is accelerated with the help of multi-threading- 
based parallelization. In order to find the exact optimal solution, we have 
used the held-karp algorithm involving dynamic programming and to reduce 
the time taken to find the optimal path, we have used a multi-threaded 
approach to parallelize the processing of sub-problems by leveraging the 


Multithreading central processing unit cores (CPUs). This method is an extension of a well 
Parallelization researched solution to the TSP; however, this method shows that solutions to 
computationally intensive problems involving sub-problems such as the 
asymmetic travelling salesman problem (ATSP) can be accelerated with the 
help of modern CPUs. 
This is an open access article under the CC BY-SA license. 
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1. INTRODUCTION 

In the simple traveling salesperson problem (TSP), we are given an undirected graph G = (V,E) 
and cost c(e) > 0 for each edge e € E and the objective is to find a hamiltonian cycle with the minimum 
cost. A hamiltonian cycle visits every vertex in V exactly once. In this paper we are addressing the 
asymmetric travelling salesman problem (ATSP) which frequently has to be dealt with in real world 
scenarios. Let M = (V,A) be a given directed graph, with vertex set V = {1,...,n} and arc set A = {(i,j): 
i,j E V}. Let cj; be the cost for the arc (i,j) E V with cy = +œ (i € V). A hamiltonian circuit (tour) of 
G is a circuit visiting each vertex of Vexactly once. The objective of the ATSP is to find a Hamiltonian circuit 
M *= (V,A*) of M with minimum cost = içi, jea» Cij 

There are different variants of the travelling salesman problem which have been addressed by 
researchers earlier and both approximate (faster) and exact (slower) solutions have been provided. Some 
possible solutions for some of the other variants as per earlier research are as follows: i) symmetric TSP: 
GPU accelerated solution provided by Kimura ef al. in [1], ii) ATSP: approximation algorithms by 


Journal homepage: http://ijeecs.iaescore.com 


1796 O ISSN: 2502-4752 


decomposing directed regular multigraphs provided by Kaplan et al. in [2], iii) ATSP with windows: exact 
solution through a graph transformation provided by Albiach et al. in [3], iv) ATSP with replenishment arcs: 
polyhedral results provided by Mak and Boland in [4]. 

Meet in the middle algorithm was used by Kazuro Kimura et al. to accelerate the execution time but 
this method can only be used on the symmetric TSP by leveraging the symmetric aspect of the problem and 
thus Kimura et al. in [1] achieved an acceleration by a factor of 1.5 and that of 1.7 using man-in-the-middle 
(MITM) when n (number of vertices) was odd and even, respectively. Since this paper aims to address the 
asymmetric travelling salesman problem, we have not used MITM, instead we make use of the following 
techniques to accelerate the processing time: i) multi-threaded program to utilize central processing unit 
(CPU) cores, ii) thread-safe hashmap to store results of the dynamic cost function. 

CPU parallelization has also been achieved for other algorithms like the ant colony optimization for 
the TSP. Ling et al. in [5] have presented an adaptive parallel ant colony optimization (PACO) algorithm 
using massively parallel processors (MPPs). A method of adjusting the time interval adaptively for 
information exchange according to the diversity of the solutions is also proposed by Chen ling et al. to avoid 
early convergence and improve the quality of results [5]. Fejzagié et al. have shown that it is possible to 
efficiently parallelize metaheuristic algorithms like ACO using task parallel library [6]. 

Gizems Ermis et al. have investigated the acceleration from CUDA by using 2-opt and 3-opt local 
search heuristics and shared explained some parallelization strategies to utilize GPU resources effectively [7]. 
Haim Kaplan et al. has provided approximation algorithms for asymmetric TSP by the decomposition of 
directed regular multigraphs [2]. Experiments by Saxena et al. in [8] show that parallelization tools like 
OpenMP and CUDA can significantly reduce the execution time for genetic algorithms used in solving the 
TSP. Rashid in [9] presented a parallel heuristic integrating a greedy approach into a genetic algorithm with 
local-search using GPU acceleration. 

Most of the previous work have presented an approximate algorithm for the general TSP or an exact 
algorithm without CPU parallelization for the ASTP. In this paper we present an exact algorithm for the 
asymmetric TSP utilizing CPU parallelization and thread-safe hashmap to accelerate the execution process. 
Alrashdan et al. have used enhanced crossover operation using genetic algorithm with their probabilities in 
order to create an efficient method to provide a near optimal solution for the ATSP [10]. A Two-way parallel 
slime mold algorithm by flow and distance (TPSMA) is proposed by Liu et al. in [11] in order to solve slime 
mold algorithm’s problem of poor local optimization. Ascheuer ef al. has provided a computational study 
which has indicated that most ATSP with time windows instances ranging till 50-70 nodes can be optimally 
solved using branch and cut [12]. Kang et al. propose an effective method of constructive crossover such that 
large number of genes can be effectively evolved by exploiting the GPUs parallel computing power and an 
effective parallel approach to genetic TSP where crossover methods cannot be easily implemented in parallel 
fashion [13]. Vasilchikov has shown that the little algorithm also has good potential for recursive-parallel 
computations and can be used with a combined approach [14]. Sample instances for the TSP (and related 
problems) from various sources and of various types are provided by TSPlib in [15]. We have also made use of 
the datasets provided by TSPlib. Svensson et al. have provided a constant factor approximation algorithm by 
the reduction to subtour partition cover (an easier problem obtained when the general connevtivity 
requirements are relaxed significantly into local connectivity conditions) [16]. Azimi et al. have presented a 
new model using simulated annealing with multiple transporters for the TSP [17]. A new hybrid algorithm 
for the probabilistic traveling salesman problem (PTSP) is proposed by Marinakis based on greedy 
randomized adaptive search procedure (GRASP), particle swarm optimization (PSO) and expanding 
neighborhood search (ENS) strategy [18]. 

Han et al. Have solved the large-scale colored travelling salesman problem using an improved ant 
colony optimization (IACO) algorithm in [19]. Eremeev et al. have verified in [20] the usefulness of a 
parallel adaptive ant colony communities for the dynamic travelling salesman problem (DTSP). Eremeev et 
al. have proposed a new memetic algorithm for the asymmetric travelling salesman problem (ATSP) with 
optimal recombination in [20]. Rashid and Mosteiro have provided a novel solution in [21] that integrates 
local-search heuristics, a greedy algorithm and a genetic algorithm. Odili et al. in [22] present a comparative 
performance analysis of some of the metaheuristic algorithms like the improved extremal optimization (IEO), 
african buffalo optimization algorithm (ABO), max-min ant system (MMAS), the heuristic randomized 
insertion algorithm (RAI) and cooperative genetic ant system (CGAS) to solve the ATSP. Fosin et al. have 
presented a new parallel iterated local search approach in [23] with 2-opt and 3-opt operators for symmetric 
TSP, using GPU acceleration. Li et al. have provided an improved multicore based parallel branch and bound 
algorithm to solve classic TSP with its shortcomings in [24]. Rico-Garcia et al. have provided a parallel 
implementation of the discrete teaching learning-based optimization algorithm (DTLBO) by utilizing a 
multicore GPU environment in order to improve the performance of the algorithm and to obtain suboptimal 
or optimal solutions to the traveling salesman problem [25]. However, most of these methods mentioned in 
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the aforementioned papers provide only an approximate solution or do not consider the asymmetric version 
of the TSP with parallelization. Our method has shown that the optimal solution of ATSP and similar 
computationally intensive problems with sub-problems can be accelerated with prallelization using modern 
CPUs. 


2. METHOD 
2.1. Theoretical analysis 

We have used the Held-Karp algorithm on a dataset of n nodes to find an exact solution to this node 
set. Before Parallelizing the algorithm, we need to perform the the theoretical analysis of the standard 
algorithm. The time and space complexity can be calculated as follows. 


Time complexity: Let the given set of nodes be V © {11,V2,... Vn} with v; as the initial node. For 


every other node v; such that i + 1, the aim is to find the minimum cost path with v; as the starting node, v; 
as the ending node such that all other nodes are visited exactly once. For a set of size k, we consider k-2 
subsets each of size k-1 such that all subsets don’t have k*t” in them. 

Thus, by evaluating the sum of minimum cost path for each subset of k-1 nodes starting with the 
initial node we get the time complexity. This is given by (1): 


n—1+ Dero k(k - 1) x ("7") (1) 

Also the occurrences of computations for the next phase is given by (2): 

noken ai (2) 
thus from (1) and (2), (1) reduces to (3): 

(n — 1)(n — 2)2"-3 + (n — 1) (3) 


on further reduction we get the time complexity as O (2”n?) 

Space complexity: The Held-Karp algorithm is executed in exponential time but still offers 
relatively faster execution compared to exhaustive enumeration. This is compensated by using a lot more 
space than exhaustive enumeration. The space complexity is given by (4): 


n—1+ DR03kx(™") (4) 
= (n—1)2"-? 
on reduction we get the space complexity as O(2”n). 


2.2. Processing architecture 

We have utilized CPU parallelization to achieve faster execution time. The architecture of a CPU 
with multiple cores is represented by Figure 1. A multi-core processor is a type of processor that contains 
multiple cores or processing unites on the same chip. This kind of processor is different from 
a superscalar processor, which can issue multiple instructions per clock cycle from one instruction stream 
(thread) and contains multiple execution units. However, multiple instructions per clock cycle from multiple 
instruction streams is issued by a multi-core processor. Every core in a multi-core processor potentially can 
be superscalar too, implying that on each clock cycle, multiple instructions can be issued from a single thread 
by each core. Simultaneous multithreading (Intel’s hyperthreading technology is an example) was an early 
form of pseudo-multi-core architecture. A processor capable of concurrent multithreading includes multiple 
execution units in the same processing unit, thus it can be said that it has a superscalar architecture and can 
issue more than one instruction per clock cycle from multiple threads. However, temporal multithreading can 
issue One instruction at a time form multiple thread where a single execution unit in the same processing unit 
is included. 


2.3. Parallelization 

Parallelization is achieved by mapping sub-problems created by the first recursive call to threads 
which will be running in parallel. This is illustrated in Figure 2 where cost(i, N) is the cost to optimally visit 
all vertices in a set with N nodes starting from node i and adj (i,j) is the cost to travel from node i to node j. 
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If there are k sub-problems created by the first recursive call and t threads mapped to them then each 
thread i will run x; sub-problems such that, 


K if k is divisible by t 
t 
X% = ll if k is not divisible by t and ie (0,t — 1) 
t 
k—t İf kis not divisible by t andi =t 


Figure 1. Multi core CPU architecture 


mapped to threads running parallely on CPU 
such that number of threads=cores on CPU 


ae oA E We, 


‘Cost(1,n-1) +adj(0,1) Cost(2,n-1) +adj(0,2) Cost(n,n-1) +adj(0,n) 


Cost(3,n-2) +adj(1,3) Cost(n,n-2) +adj(1,n) 


Cost(2,n-2) +adj(1,2) 


Recursive Subproblem calls 


Figure 2. Thread mapping to recursive call 


Indonesian J Elec Eng & Comp Sci, Vol. 25, No. 3, March 2022: 1795-1802 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 O 1799 


The challenge was to create a common thread-safe strucure to cache the intermediary results from 
the recursive calls. This data structure should be shared with all the threads. For this purpose, we have used 
Java’s ConcurrentHashMap with Java threads to achieve thread-safe parallelization. The hashmap creates an 
empty, new map with the specified initial capacity, concurrency and load factor level. The implementation of 
initial capacity performs internal sizing to accommodate these many elements whereas the implementation of 
concurrency tries to do the same. Initial concurrency level parameters and capacity parameter of 
ConcurrentHashMap constructor (or Object) in Java are set to 16 by default. As we are parallelizing the 
process with 9 threads we do not need to change the parameters. Thus, instead of using a map wide lock, 
ConcurrentHashMap maintains a list of locks by default such that the initial capacity is equal to the number 
of locks. Each lock is used to lock on a single bucket of the Map. This indicates that the number of threads 
which is set equal to the concurrency level specified in the parameter can modify the collection at the same 
time only if each thread works on different bucket. Hence, unlike hashtable, the operations like delete, create, 
update and read are done without locking on the entire map. Retrieval operations are usually not blocked, so 
they may overlap with operations involving updates. The entire architecture is illustrated by Figure 3. 


Figure 3. ConcurrentHashMap internal structure 


Concurrency level constructor argument(optional) guides the allowed concurrency among 
operations involving updates, which is used as a hint for internal sizing. In order to permit the indicated 
number of concurrent updates without contention the table is partitioned internally. The actual concurrency 
will vary since hashtables in placements are random in nature. Algorithmically the process can be represented 
as the following Figure 4. 


Algorithm 1: Parallelized Held-Karp Algorithm 


0 Data: nodes set S with |S| = n, initial node vy E S 
weights from node vy; to vj = W (v; vj) 
number of threads = t 

Result: A shortest tour that visits all locations in V 

Initialize threadsafe map T to store previous weights; 

fun cost(S, v): 

if T contains key S,v then: 
return T(S,v); 
else if |S| = 1 then: 
return W (v, vo); 
else 
Set v as a visited node; 
for each node v; in S: 
if v; is not visited: 
SubCost <— Min(W (v, v;) +cost(S-{v},v;), SubCost); 
T(S,v) < SubCost; 

12 Set v as a unvisited node; 

13 return SubCost; 

14 map cost(S,v,) : cost(S, v») to thread 1< C, 

15 map cost(S, vn) : cost(S, vin) to thread 2 e C, 

t 


CHOAIADNARUNERWNH 


— 
= © 


16 map till cost(S, Vaçt-1)/t) + cost(S, Vn) to thread t — Cn 
17 return Min(C,, C; ... Cn) 


Figure 4. Parallelized held-karp Algorithm 
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3. RESULTS AND DISCUSSION 

Testing the algorithm with 17 nodes produces the shortest and the most optimal path with total 
distance = 39 as verified from TSPlib. Figure 5 captures the time of execution for solving the problem with 
the nodes n ranging from 18 to 22 from the 34-node dataset of TSPlib considering t threads(x-axis) running 
in parallel at a time. 


Time vs Threads 


500 + 


400 5 


300 + 


Time in seconds 


100 4 


Threads 


Figure 5. Time vs number of threads 


From Figure 5 we can clearly see parallelization with more number of threads running in parallel 
has helped in reducing the execution time in each of the cases considering nodes n such that 18 < n < 22. 
The same information from Figure 5 is represented in Table 1. For each node set of n nodes from the 34-node 
dataset the speed-up ratio such that 18 < n < 22 is represented in Table 2. 


Table 1. Performance in terms of time taken in seconds 


Threads 
1 2 3 4 5 6 7 8 9 
Nodes 

18 14.151 7.375 5.392 4.72 4.327 3.929 3.844 3.643 3.526 
19 35.468 18.996 14.222 12.629 11.912 10.594 10.496 9.76 9.516 
20 86.207 46.732 35.457 30.829 28.06 27.388 26.471 26.391 24.601 
21 211.379 115.009 87.717 78.567 70.265 71.04 66.224 67.849 65.61 
22 501.693 279.428 218.44 200.299 182.02 187.85 180.67 182.016 172.911 


Table 2. Speed-up ratio 
Nodes Speed-up Ratio 


18 4.013 
19 3.727 
20 3.504 
21 3.22 
22 2.9 


4. CONCLUSION 

The experiment has successfully demonstrated that the proposed parallelized algorithm for solving 
the ATSP optimally helps in reducing the execution time compared to traditional Held karp algorithm and is 
a viable method to compute the optimal path for the ATSP. Although the computation time is higher than 
suboptimal methods, the proposed methodology gives the exact solution to the ATSP, which justifies the 
high computational time. Other optimal and suboptimal methods can incorporate CPU parallelization like the 
proposed methodology to produce even better results. In the future, hybrid algorithms can be used along with 
parallelization using GPU and CPU both to solve computationally instensive problems such as the ATSP. 
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